Supervised Learning

In [1]:
suppressPackageStartupMessages(library(tidyverse))
In [2]:
options(repr.plot.width=4, repr.plot.height=4)

Regression

In [3]:
set.seed(10)

x <- 1:10
y = x + rnorm(10, 0.5, 1)
In [4]:
plot(x, y, xlim = c(0, 10), ylim = c(0, 10), pch = 19)
../_images/cliburn_Supervised_Learning_Solutions_5_0.png

Detour: Image formatting in base graphics

Point characters

Point characters

Linear Regression

In [5]:
model.lm <- lm(y ~ x)

Predicting from a fitted model

In [6]:
predict(model.lm)
1
1.26208403232277
2
2.20591939228898
3
3.14975475225518
4
4.09359011222138
5
5.03742547218758
6
5.98126083215378
7
6.92509619211998
8
7.86893155208618
9
8.81276691205238
10
9.75660227201858

Predicting for new data

In [7]:
predict(model.lm, data.frame(x = c(1.5, 3.5, 5.5)))
1
1.73400171230588
2
3.62167243223828
3
5.50934315217068
In [8]:
plot(x, y, xlim = c(0, 10), ylim = c(0, 10), pch = 19)
lines(x, predict(model.lm), col = 3)
../_images/cliburn_Supervised_Learning_Solutions_13_0.png

Alternative

Note that abline plots the fitted line throughout the plot limits for the x-axis.

In [9]:
plot(x, y, xlim = c(0, 10), ylim = c(0, 10), pch = 19)
abline(model.lm, col="purple")
../_images/cliburn_Supervised_Learning_Solutions_15_0.png

Detour: Colors in base graphics

Colors

Colors

You can also used named colors - run colors to get all named colors available.

In [10]:
colors()
  1. 'white'
  2. 'aliceblue'
  3. 'antiquewhite'
  4. 'antiquewhite1'
  5. 'antiquewhite2'
  6. 'antiquewhite3'
  7. 'antiquewhite4'
  8. 'aquamarine'
  9. 'aquamarine1'
  10. 'aquamarine2'
  11. 'aquamarine3'
  12. 'aquamarine4'
  13. 'azure'
  14. 'azure1'
  15. 'azure2'
  16. 'azure3'
  17. 'azure4'
  18. 'beige'
  19. 'bisque'
  20. 'bisque1'
  21. 'bisque2'
  22. 'bisque3'
  23. 'bisque4'
  24. 'black'
  25. 'blanchedalmond'
  26. 'blue'
  27. 'blue1'
  28. 'blue2'
  29. 'blue3'
  30. 'blue4'
  31. 'blueviolet'
  32. 'brown'
  33. 'brown1'
  34. 'brown2'
  35. 'brown3'
  36. 'brown4'
  37. 'burlywood'
  38. 'burlywood1'
  39. 'burlywood2'
  40. 'burlywood3'
  41. 'burlywood4'
  42. 'cadetblue'
  43. 'cadetblue1'
  44. 'cadetblue2'
  45. 'cadetblue3'
  46. 'cadetblue4'
  47. 'chartreuse'
  48. 'chartreuse1'
  49. 'chartreuse2'
  50. 'chartreuse3'
  51. 'chartreuse4'
  52. 'chocolate'
  53. 'chocolate1'
  54. 'chocolate2'
  55. 'chocolate3'
  56. 'chocolate4'
  57. 'coral'
  58. 'coral1'
  59. 'coral2'
  60. 'coral3'
  61. 'coral4'
  62. 'cornflowerblue'
  63. 'cornsilk'
  64. 'cornsilk1'
  65. 'cornsilk2'
  66. 'cornsilk3'
  67. 'cornsilk4'
  68. 'cyan'
  69. 'cyan1'
  70. 'cyan2'
  71. 'cyan3'
  72. 'cyan4'
  73. 'darkblue'
  74. 'darkcyan'
  75. 'darkgoldenrod'
  76. 'darkgoldenrod1'
  77. 'darkgoldenrod2'
  78. 'darkgoldenrod3'
  79. 'darkgoldenrod4'
  80. 'darkgray'
  81. 'darkgreen'
  82. 'darkgrey'
  83. 'darkkhaki'
  84. 'darkmagenta'
  85. 'darkolivegreen'
  86. 'darkolivegreen1'
  87. 'darkolivegreen2'
  88. 'darkolivegreen3'
  89. 'darkolivegreen4'
  90. 'darkorange'
  91. 'darkorange1'
  92. 'darkorange2'
  93. 'darkorange3'
  94. 'darkorange4'
  95. 'darkorchid'
  96. 'darkorchid1'
  97. 'darkorchid2'
  98. 'darkorchid3'
  99. 'darkorchid4'
  100. 'darkred'
  101. 'darksalmon'
  102. 'darkseagreen'
  103. 'darkseagreen1'
  104. 'darkseagreen2'
  105. 'darkseagreen3'
  106. 'darkseagreen4'
  107. 'darkslateblue'
  108. 'darkslategray'
  109. 'darkslategray1'
  110. 'darkslategray2'
  111. 'darkslategray3'
  112. 'darkslategray4'
  113. 'darkslategrey'
  114. 'darkturquoise'
  115. 'darkviolet'
  116. 'deeppink'
  117. 'deeppink1'
  118. 'deeppink2'
  119. 'deeppink3'
  120. 'deeppink4'
  121. 'deepskyblue'
  122. 'deepskyblue1'
  123. 'deepskyblue2'
  124. 'deepskyblue3'
  125. 'deepskyblue4'
  126. 'dimgray'
  127. 'dimgrey'
  128. 'dodgerblue'
  129. 'dodgerblue1'
  130. 'dodgerblue2'
  131. 'dodgerblue3'
  132. 'dodgerblue4'
  133. 'firebrick'
  134. 'firebrick1'
  135. 'firebrick2'
  136. 'firebrick3'
  137. 'firebrick4'
  138. 'floralwhite'
  139. 'forestgreen'
  140. 'gainsboro'
  141. 'ghostwhite'
  142. 'gold'
  143. 'gold1'
  144. 'gold2'
  145. 'gold3'
  146. 'gold4'
  147. 'goldenrod'
  148. 'goldenrod1'
  149. 'goldenrod2'
  150. 'goldenrod3'
  151. 'goldenrod4'
  152. 'gray'
  153. 'gray0'
  154. 'gray1'
  155. 'gray2'
  156. 'gray3'
  157. 'gray4'
  158. 'gray5'
  159. 'gray6'
  160. 'gray7'
  161. 'gray8'
  162. 'gray9'
  163. 'gray10'
  164. 'gray11'
  165. 'gray12'
  166. 'gray13'
  167. 'gray14'
  168. 'gray15'
  169. 'gray16'
  170. 'gray17'
  171. 'gray18'
  172. 'gray19'
  173. 'gray20'
  174. 'gray21'
  175. 'gray22'
  176. 'gray23'
  177. 'gray24'
  178. 'gray25'
  179. 'gray26'
  180. 'gray27'
  181. 'gray28'
  182. 'gray29'
  183. 'gray30'
  184. 'gray31'
  185. 'gray32'
  186. 'gray33'
  187. 'gray34'
  188. 'gray35'
  189. 'gray36'
  190. 'gray37'
  191. 'gray38'
  192. 'gray39'
  193. 'gray40'
  194. 'gray41'
  195. 'gray42'
  196. 'gray43'
  197. 'gray44'
  198. 'gray45'
  199. 'gray46'
  200. 'gray47'
  201. 'gray48'
  202. 'gray49'
  203. 'gray50'
  204. 'gray51'
  205. 'gray52'
  206. 'gray53'
  207. 'gray54'
  208. 'gray55'
  209. 'gray56'
  210. 'gray57'
  211. 'gray58'
  212. 'gray59'
  213. 'gray60'
  214. 'gray61'
  215. 'gray62'
  216. 'gray63'
  217. 'gray64'
  218. 'gray65'
  219. 'gray66'
  220. 'gray67'
  221. 'gray68'
  222. 'gray69'
  223. 'gray70'
  224. 'gray71'
  225. 'gray72'
  226. 'gray73'
  227. 'gray74'
  228. 'gray75'
  229. 'gray76'
  230. 'gray77'
  231. 'gray78'
  232. 'gray79'
  233. 'gray80'
  234. 'gray81'
  235. 'gray82'
  236. 'gray83'
  237. 'gray84'
  238. 'gray85'
  239. 'gray86'
  240. 'gray87'
  241. 'gray88'
  242. 'gray89'
  243. 'gray90'
  244. 'gray91'
  245. 'gray92'
  246. 'gray93'
  247. 'gray94'
  248. 'gray95'
  249. 'gray96'
  250. 'gray97'
  251. 'gray98'
  252. 'gray99'
  253. 'gray100'
  254. 'green'
  255. 'green1'
  256. 'green2'
  257. 'green3'
  258. 'green4'
  259. 'greenyellow'
  260. 'grey'
  261. 'grey0'
  262. 'grey1'
  263. 'grey2'
  264. 'grey3'
  265. 'grey4'
  266. 'grey5'
  267. 'grey6'
  268. 'grey7'
  269. 'grey8'
  270. 'grey9'
  271. 'grey10'
  272. 'grey11'
  273. 'grey12'
  274. 'grey13'
  275. 'grey14'
  276. 'grey15'
  277. 'grey16'
  278. 'grey17'
  279. 'grey18'
  280. 'grey19'
  281. 'grey20'
  282. 'grey21'
  283. 'grey22'
  284. 'grey23'
  285. 'grey24'
  286. 'grey25'
  287. 'grey26'
  288. 'grey27'
  289. 'grey28'
  290. 'grey29'
  291. 'grey30'
  292. 'grey31'
  293. 'grey32'
  294. 'grey33'
  295. 'grey34'
  296. 'grey35'
  297. 'grey36'
  298. 'grey37'
  299. 'grey38'
  300. 'grey39'
  301. 'grey40'
  302. 'grey41'
  303. 'grey42'
  304. 'grey43'
  305. 'grey44'
  306. 'grey45'
  307. 'grey46'
  308. 'grey47'
  309. 'grey48'
  310. 'grey49'
  311. 'grey50'
  312. 'grey51'
  313. 'grey52'
  314. 'grey53'
  315. 'grey54'
  316. 'grey55'
  317. 'grey56'
  318. 'grey57'
  319. 'grey58'
  320. 'grey59'
  321. 'grey60'
  322. 'grey61'
  323. 'grey62'
  324. 'grey63'
  325. 'grey64'
  326. 'grey65'
  327. 'grey66'
  328. 'grey67'
  329. 'grey68'
  330. 'grey69'
  331. 'grey70'
  332. 'grey71'
  333. 'grey72'
  334. 'grey73'
  335. 'grey74'
  336. 'grey75'
  337. 'grey76'
  338. 'grey77'
  339. 'grey78'
  340. 'grey79'
  341. 'grey80'
  342. 'grey81'
  343. 'grey82'
  344. 'grey83'
  345. 'grey84'
  346. 'grey85'
  347. 'grey86'
  348. 'grey87'
  349. 'grey88'
  350. 'grey89'
  351. 'grey90'
  352. 'grey91'
  353. 'grey92'
  354. 'grey93'
  355. 'grey94'
  356. 'grey95'
  357. 'grey96'
  358. 'grey97'
  359. 'grey98'
  360. 'grey99'
  361. 'grey100'
  362. 'honeydew'
  363. 'honeydew1'
  364. 'honeydew2'
  365. 'honeydew3'
  366. 'honeydew4'
  367. 'hotpink'
  368. 'hotpink1'
  369. 'hotpink2'
  370. 'hotpink3'
  371. 'hotpink4'
  372. 'indianred'
  373. 'indianred1'
  374. 'indianred2'
  375. 'indianred3'
  376. 'indianred4'
  377. 'ivory'
  378. 'ivory1'
  379. 'ivory2'
  380. 'ivory3'
  381. 'ivory4'
  382. 'khaki'
  383. 'khaki1'
  384. 'khaki2'
  385. 'khaki3'
  386. 'khaki4'
  387. 'lavender'
  388. 'lavenderblush'
  389. 'lavenderblush1'
  390. 'lavenderblush2'
  391. 'lavenderblush3'
  392. 'lavenderblush4'
  393. 'lawngreen'
  394. 'lemonchiffon'
  395. 'lemonchiffon1'
  396. 'lemonchiffon2'
  397. 'lemonchiffon3'
  398. 'lemonchiffon4'
  399. 'lightblue'
  400. 'lightblue1'
  401. 'lightblue2'
  402. 'lightblue3'
  403. 'lightblue4'
  404. 'lightcoral'
  405. 'lightcyan'
  406. 'lightcyan1'
  407. 'lightcyan2'
  408. 'lightcyan3'
  409. 'lightcyan4'
  410. 'lightgoldenrod'
  411. 'lightgoldenrod1'
  412. 'lightgoldenrod2'
  413. 'lightgoldenrod3'
  414. 'lightgoldenrod4'
  415. 'lightgoldenrodyellow'
  416. 'lightgray'
  417. 'lightgreen'
  418. 'lightgrey'
  419. 'lightpink'
  420. 'lightpink1'
  421. 'lightpink2'
  422. 'lightpink3'
  423. 'lightpink4'
  424. 'lightsalmon'
  425. 'lightsalmon1'
  426. 'lightsalmon2'
  427. 'lightsalmon3'
  428. 'lightsalmon4'
  429. 'lightseagreen'
  430. 'lightskyblue'
  431. 'lightskyblue1'
  432. 'lightskyblue2'
  433. 'lightskyblue3'
  434. 'lightskyblue4'
  435. 'lightslateblue'
  436. 'lightslategray'
  437. 'lightslategrey'
  438. 'lightsteelblue'
  439. 'lightsteelblue1'
  440. 'lightsteelblue2'
  441. 'lightsteelblue3'
  442. 'lightsteelblue4'
  443. 'lightyellow'
  444. 'lightyellow1'
  445. 'lightyellow2'
  446. 'lightyellow3'
  447. 'lightyellow4'
  448. 'limegreen'
  449. 'linen'
  450. 'magenta'
  451. 'magenta1'
  452. 'magenta2'
  453. 'magenta3'
  454. 'magenta4'
  455. 'maroon'
  456. 'maroon1'
  457. 'maroon2'
  458. 'maroon3'
  459. 'maroon4'
  460. 'mediumaquamarine'
  461. 'mediumblue'
  462. 'mediumorchid'
  463. 'mediumorchid1'
  464. 'mediumorchid2'
  465. 'mediumorchid3'
  466. 'mediumorchid4'
  467. 'mediumpurple'
  468. 'mediumpurple1'
  469. 'mediumpurple2'
  470. 'mediumpurple3'
  471. 'mediumpurple4'
  472. 'mediumseagreen'
  473. 'mediumslateblue'
  474. 'mediumspringgreen'
  475. 'mediumturquoise'
  476. 'mediumvioletred'
  477. 'midnightblue'
  478. 'mintcream'
  479. 'mistyrose'
  480. 'mistyrose1'
  481. 'mistyrose2'
  482. 'mistyrose3'
  483. 'mistyrose4'
  484. 'moccasin'
  485. 'navajowhite'
  486. 'navajowhite1'
  487. 'navajowhite2'
  488. 'navajowhite3'
  489. 'navajowhite4'
  490. 'navy'
  491. 'navyblue'
  492. 'oldlace'
  493. 'olivedrab'
  494. 'olivedrab1'
  495. 'olivedrab2'
  496. 'olivedrab3'
  497. 'olivedrab4'
  498. 'orange'
  499. 'orange1'
  500. 'orange2'
  501. 'orange3'
  502. 'orange4'
  503. 'orangered'
  504. 'orangered1'
  505. 'orangered2'
  506. 'orangered3'
  507. 'orangered4'
  508. 'orchid'
  509. 'orchid1'
  510. 'orchid2'
  511. 'orchid3'
  512. 'orchid4'
  513. 'palegoldenrod'
  514. 'palegreen'
  515. 'palegreen1'
  516. 'palegreen2'
  517. 'palegreen3'
  518. 'palegreen4'
  519. 'paleturquoise'
  520. 'paleturquoise1'
  521. 'paleturquoise2'
  522. 'paleturquoise3'
  523. 'paleturquoise4'
  524. 'palevioletred'
  525. 'palevioletred1'
  526. 'palevioletred2'
  527. 'palevioletred3'
  528. 'palevioletred4'
  529. 'papayawhip'
  530. 'peachpuff'
  531. 'peachpuff1'
  532. 'peachpuff2'
  533. 'peachpuff3'
  534. 'peachpuff4'
  535. 'peru'
  536. 'pink'
  537. 'pink1'
  538. 'pink2'
  539. 'pink3'
  540. 'pink4'
  541. 'plum'
  542. 'plum1'
  543. 'plum2'
  544. 'plum3'
  545. 'plum4'
  546. 'powderblue'
  547. 'purple'
  548. 'purple1'
  549. 'purple2'
  550. 'purple3'
  551. 'purple4'
  552. 'red'
  553. 'red1'
  554. 'red2'
  555. 'red3'
  556. 'red4'
  557. 'rosybrown'
  558. 'rosybrown1'
  559. 'rosybrown2'
  560. 'rosybrown3'
  561. 'rosybrown4'
  562. 'royalblue'
  563. 'royalblue1'
  564. 'royalblue2'
  565. 'royalblue3'
  566. 'royalblue4'
  567. 'saddlebrown'
  568. 'salmon'
  569. 'salmon1'
  570. 'salmon2'
  571. 'salmon3'
  572. 'salmon4'
  573. 'sandybrown'
  574. 'seagreen'
  575. 'seagreen1'
  576. 'seagreen2'
  577. 'seagreen3'
  578. 'seagreen4'
  579. 'seashell'
  580. 'seashell1'
  581. 'seashell2'
  582. 'seashell3'
  583. 'seashell4'
  584. 'sienna'
  585. 'sienna1'
  586. 'sienna2'
  587. 'sienna3'
  588. 'sienna4'
  589. 'skyblue'
  590. 'skyblue1'
  591. 'skyblue2'
  592. 'skyblue3'
  593. 'skyblue4'
  594. 'slateblue'
  595. 'slateblue1'
  596. 'slateblue2'
  597. 'slateblue3'
  598. 'slateblue4'
  599. 'slategray'
  600. 'slategray1'
  601. 'slategray2'
  602. 'slategray3'
  603. 'slategray4'
  604. 'slategrey'
  605. 'snow'
  606. 'snow1'
  607. 'snow2'
  608. 'snow3'
  609. 'snow4'
  610. 'springgreen'
  611. 'springgreen1'
  612. 'springgreen2'
  613. 'springgreen3'
  614. 'springgreen4'
  615. 'steelblue'
  616. 'steelblue1'
  617. 'steelblue2'
  618. 'steelblue3'
  619. 'steelblue4'
  620. 'tan'
  621. 'tan1'
  622. 'tan2'
  623. 'tan3'
  624. 'tan4'
  625. 'thistle'
  626. 'thistle1'
  627. 'thistle2'
  628. 'thistle3'
  629. 'thistle4'
  630. 'tomato'
  631. 'tomato1'
  632. 'tomato2'
  633. 'tomato3'
  634. 'tomato4'
  635. 'turquoise'
  636. 'turquoise1'
  637. 'turquoise2'
  638. 'turquoise3'
  639. 'turquoise4'
  640. 'violet'
  641. 'violetred'
  642. 'violetred1'
  643. 'violetred2'
  644. 'violetred3'
  645. 'violetred4'
  646. 'wheat'
  647. 'wheat1'
  648. 'wheat2'
  649. 'wheat3'
  650. 'wheat4'
  651. 'whitesmoke'
  652. 'yellow'
  653. 'yellow1'
  654. 'yellow2'
  655. 'yellow3'
  656. 'yellow4'
  657. 'yellowgreen'

Polynomial Regression

In [11]:
model.poly5 <- lm(y ~ poly(x, 5))
In [12]:
plot(x, y, xlim = c(0, 10), ylim = c(0, 10), pch = 19)
lines(x, predict(model.poly5), col = "orange")
../_images/cliburn_Supervised_Learning_Solutions_20_0.png

Spline Regression

Splines are essentially piecewise polynomial fits. There are two parameters

  • degree determines the type of piecewise polynomials used (e.g. degree=3 uses cubic polynomials)
  • knots are where the piecewise polynomials meet (determine number of pieces)
In [13]:
library(splines)
In [14]:
model.spl <- lm(y ~ bs(x, degree=3, knots=4))
In [15]:
plot(x, y, xlim = c(0, 10), ylim = c(0, 10), pch = 19)
lines(x, predict(model.spl), col = 3)
../_images/cliburn_Supervised_Learning_Solutions_24_0.png

Connect the dots

In [16]:
plot(x, y, xlim = c(0, 10), ylim = c(0, 10), pch = 19)
lines(x, y, col="red")
../_images/cliburn_Supervised_Learning_Solutions_26_0.png

Classification

In [17]:
levels(iris$Species)
  1. 'setosa'
  2. 'versicolor'
  3. 'virginica'
In [18]:
df <- iris %>% filter(Species != "setosa")
df$Species <- droplevels(df$Species)
In [19]:
options(repr.plot.width=8, repr.plot.height=4)
par(mfrow=c(1,2))
plot(df[,1], df[,2], col=as.integer(df$Species))
plot(df[,3], df[,4], col=as.integer(df$Species))
../_images/cliburn_Supervised_Learning_Solutions_31_0.png
In [20]:
sample.int(10, 5)
  1. 9
  2. 6
  3. 7
  4. 3
  5. 8

Split into 50% training and 50% test sets

In [21]:
library(class)
In [22]:
set.seed(10)
sample <- sample.int(n = nrow(df), size = floor(.5*nrow(df)), replace = F)
train <- df[sample, 1:4]
test  <- df[-sample, 1:4]
train.cls = df$Species[sample]
test.cls = df$Species[-sample]
In [23]:
test.pred <- knn(train, test, train.cls, k = 3)
In [24]:
options(repr.plot.width=8, repr.plot.height=4)
par(mfrow=c(1,2))
plot(test[,1], test[,2], col=as.integer(test.cls), main="Truth")
plot(test[,1], test[,2], col=as.integer(test.pred), main="Predicted")
../_images/cliburn_Supervised_Learning_Solutions_37_0.png
In [25]:
table(test.pred, test.cls)
test.cls
test.pred    versicolor virginica
  versicolor         25         0
  virginica           3        22

Logistic Regression

In [26]:
head(train)
Sepal.LengthSepal.WidthPetal.LengthPetal.Width
516.33.36.02.5
315.52.43.81.1
426.13.04.61.4
687.73.86.72.2
96.62.94.61.3
226.12.84.01.3

The warning is due to the fact that vanilla logistic regression does not like perfectly separated data sets. The usual remedy is to add a penalization factor.

In [27]:
model.logistic <- glm(train.cls ~ .,
                      family=binomial(link='logit'), data=train)
Warning message:
“glm.fit: algorithm did not converge”Warning message:
“glm.fit: fitted probabilities numerically 0 or 1 occurred”
In [28]:
summary(model.logistic)

Call:
glm(formula = train.cls ~ ., family = binomial(link = "logit"),
    data = train)

Deviance Residuals:
       Min          1Q      Median          3Q         Max
-2.347e-05  -2.110e-08   2.110e-08   2.110e-08   2.792e-05

Coefficients:
               Estimate Std. Error z value Pr(>|z|)
(Intercept)   1.658e+00  1.263e+06   0.000    1.000
Sepal.Length -6.152e+01  9.061e+04  -0.001    0.999
Sepal.Width  -8.214e+01  4.829e+05   0.000    1.000
Petal.Length  7.759e+01  1.673e+05   0.000    1.000
Petal.Width   1.453e+02  2.148e+05   0.001    0.999

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6.8593e+01  on 49  degrees of freedom
Residual deviance: 2.0984e-09  on 45  degrees of freedom
AIC: 10

Number of Fisher Scoring iterations: 25

In [29]:
test.pred <- ifelse(predict(model.logistic, test) < 0, 1, 2)
In [30]:
options(repr.plot.width=8, repr.plot.height=4)
par(mfrow=c(1,2))
plot(test[,1], test[,2], col=as.integer(test.cls), main="Truth")
plot(test[,1], test[,2], col=test.pred, main="Predicted")
../_images/cliburn_Supervised_Learning_Solutions_45_0.png
In [31]:
table(test.pred, test.cls)
test.cls
test.pred versicolor virginica
        1         24         3
        2          4        19

Overfitting

In [32]:
set.seed(10)

x <- 1:10
y = 0.1*x^2 + 0.2*x + rnorm(10, 0.5, 1)
In [33]:
m1 <- lm(y ~ x)
m2 <- lm(y ~ poly(x, 2))
m5 <- lm(y ~ poly(x, 5))
m9 <- lm(y ~ poly(x, 9))
In [34]:
options(repr.plot.width=8, repr.plot.height=8)

xp = seq(0, 10, length.out = 100)
df <- data.frame(x=xp)

par(mfrow=c(2,2))
plot(x, y, xlim = c(0, 10), pch = 19)
lines(xp, predict(m1, df), col = "red")
plot(x, y, xlim = c(0, 10), pch = 19)
lines(xp, predict(m2, df), col = "red")
plot(x, y, xlim = c(0, 10), pch = 19)
lines(xp, predict(m5, df), col = "red")
plot(x, y, xlim = c(0, 10), pch = 19)
lines(xp, predict(m9, df), col = "red")
../_images/cliburn_Supervised_Learning_Solutions_50_0.png
In [35]:
for (model in list(m1, m2, m5, m9)) {
    print(round(sum((predict(model, data.frame(x=x)) - y)^2), 2))
    }
[1] 9.18
[1] 4.15
[1] 2.09
[1] 0
In [36]:
test.x <- runif(10, 0, 10)
test.y <- 0.1*test.x^2 + 0.2*test.x + rnorm(10, 0.5, 3)
In [37]:
for (model in list(m1, m2, m5, m9)) {
    print(round(sum((predict(model, data.frame(x=test.x)) - test.y)^2), 2))
    }
[1] 112.25
[1] 103.79
[1] 99.8
[1] 111.98

Cross-validation

In [38]:
for (k in 1:5) {
    rss <- 0
    for (i in 1:10) {
        xmo <- x[-i]
        ymo <- y[-i]
        model <- lm(ymo ~ poly(xmo, k))
        res <- (predict(model, data.frame(xmo=x[i])) - y[i])^2
        rss <- rss + res
    }
    print(k)
    print(rss)
}
[1] 1
       1
17.61035
[1] 2
      1
8.68587
[1] 3
       1
19.42014
[1] 4
       1
20.58078
[1] 5
       1
50.26568

Feature Selection

In [39]:
suppressPackageStartupMessages(library(genefilter))
In [40]:
set.seed(123)
n = 20
m = 1000
EXPRS = matrix(rnorm(2 * n * m), 2 * n, m)
rownames(EXPRS) = paste("pt", 1:(2 * n), sep = "")
colnames(EXPRS) = paste("g", 1:m, sep = "")
grp = as.factor(rep(1:2, c(n, n)))
In [41]:
head(EXPRS, 3)
g1g2g3g4g5g6g7g8g9g10g991g992g993g994g995g996g997g998g999g1000
pt1-0.5604756 -0.6947070 0.005764186 0.1176466 1.052711 2.1988103 -0.7886220 -1.6674751 0.2374303 -0.2052993 0.3780725 1.974814 -0.4535280 -0.4552866 -2.3004639 -0.3804398 0.2870161 -0.2018602 -1.6727583 1.1379048
pt2-0.2301775 -0.2079173 0.385280401-0.9474746 -1.049177 1.3124130 -0.5021987 0.7364960 1.2181086 0.6511933 0.5981352 -1.021826 -2.0371182 1.5636880 -0.9501855 0.6671640 -0.6702249 1.1181721 -0.5414325 1.2684239
pt3 1.5587083 -1.2653964 -0.370660032-0.4905574 -1.260155 -0.2651451 1.4960607 0.3860266 -1.3387743 0.2737665 0.5774870 0.853561 -0.3030158 -0.1434414 -0.8478627 0.2413405 -0.5417177 0.1625707 0.1995339 0.0427062
In [42]:
stats = abs(rowttests(t(EXPRS), grp)$statistic)
In [43]:
head(stats,3)
  1. 0.674624258566226
  2. 0.741717473727548
  3. 3.02542265710511
In [44]:
ii <- order(-stats)
In [45]:
TOPEXPRS <- EXPRS[, ii[1:10]]
In [46]:
mod0 = knn(train = TOPEXPRS, test = TOPEXPRS, cl = grp, k = 3)
table(mod0, grp)
grp
mod0  1  2
   1 17  0
   2  3 20

Note: Feature selection is not part of the CV process, and so the results are OVER-OPTIMISTIC

In [47]:
mode1 = knn.cv(TOPEXPRS, grp, k = 3)
table(mode1, grp)
grp
mode1  1  2
    1 16  0
    2  4 20
In [48]:
suppressPackageStartupMessages(library(multtest))
In [49]:
top.features <- function(EXP, resp, test, fsnum) {
     top.features.i <- function(i, EXP, resp, test, fsnum) {
         stats <- abs(mt.teststat(EXP[, -i], resp[-i], test = test))
         ii <- order(-stats)[1:fsnum]
         rownames(EXP)[ii]
    }
    sapply(1:ncol(EXP), top.features.i,
           EXP = EXP, resp = resp, test = test, fsnum = fsnum)
}
In [50]:
# This function evaluates the knn
knn.loocv <- function(EXP, resp, test, k, fsnum, tabulate = FALSE, permute = FALSE) {
    if (permute) {
        resp = sample(resp)
        }
    topfeat = top.features(EXP, resp, test, fsnum)
    pids = rownames(EXP)
    EXP = t(EXP)
    colnames(EXP) = as.character(pids)
    knn.loocv.i = function(i, EXP, resp, k, topfeat) {
    ii = topfeat[, i]
    mod = knn(train = EXP[-i, ii], test = EXP[i, ii], cl = resp[-i], k = k)[1] }
    out = sapply(1:nrow(EXP), knn.loocv.i,
                 EXP = EXP, resp = resp, k
                 = k, topfeat = topfeat)
    if (tabulate)
        out = ftable(pred = out, obs = resp)
    return(out)
}

Reminder of what the data look like

In [51]:
EXPRS[1:5, 1:5]
g1g2g3g4g5
pt1-0.56047565 -0.6947070 0.005764186 0.1176466 1.0527115
pt2-0.23017749 -0.2079173 0.385280401-0.9474746 -1.0491770
pt3 1.55870831 -1.2653964 -0.370660032-0.4905574 -1.2601552
pt4 0.07050839 2.1689560 0.644376549-0.2560922 3.2410399
pt5 0.12928774 1.2079620 -0.220486562 1.8438620 -0.4168576
In [52]:
levels(grp)
  1. '1'
  2. '2'
In [53]:
knn.loocv(t(EXPRS), grp, "t.equalvar", 3, 10, TRUE)
obs  1  2
pred
1         7  7
2        13 13

Cross-validation done right (simple version)

In [54]:
pred <- numeric(2*n)
for (i in 1:(2*n)) {
    stats = abs(rowttests(t(EXPRS[-i,]), grp[-i])$statistic)
    ii <- order(-stats)
    TOPEXPRS <- EXPRS[-i, ii[1:10]]
    pred[i] = knn(TOPEXPRS, EXPRS[i, ii[1:10]], grp[-i], k = 3)
}
In [55]:
table(pred, grp)
grp
pred  1  2
   1  7  7
   2 13 13

Exercise

  • Repeat the last experiment with a noisy quantitative outcome
  • First simulate a data matrix of dimension n = 50 (patients) and m (genes)
  • Next draw the outcome for n=50 patients from a standard normal distribution independent of the data matrix
  • There is no relationship between the expressions and the outcome (by design)
  • We consider m=45 and m=50000
  • We conduct Naive LOOCV using the top 10 features

Data Generation

In [56]:
set.seed(123)
n <- 50
m <- 50000
EXPRS <- matrix(rnorm(n*m), n, m)
rownames(EXPRS) = paste("pt", 1:n, sep = "")
colnames(EXPRS) = paste("g", 1:m, sep = "")
OUTCOMES <- rnorm(n)

Proper cross-validation

In [57]:
pred0 <- numeric(n)
for (i in 1:n) {

    stats <- abs(cor(OUTCOMES[-i], EXPRS[-i,]))
    ii <- order(-stats)
    TOPEXPRS <- EXPRS[, ii[1:10]]

    data <- data.frame(y = OUTCOMES, x = I(TOPEXPRS))
    model <- lm(y ~ x, data=data[-i,])
    pred0[i] <- predict(model, data[i,])
}

Naive cross-validation

In [58]:
stats <- abs(cor(OUTCOMES, EXPRS))
ii <- order(-stats)
TOPEXPRS <- EXPRS[, ii[1:10]]
In [59]:
pred <- numeric(n)
for (i in 1:n) {
    data <- data.frame(y = OUTCOMES, x = I(TOPEXPRS))
    model <- lm(y ~ x, data=data[-i,])
    pred[i] <- predict(model, data[i,])
}
In [60]:
plot(OUTCOMES, pred, col='red', xlab="Observed Values", ylab="Predicted Values")
points(OUTCOMES, pred.0, pch=4, col='blue')
abline(lm(pred ~ OUTCOMES), col='red')
Error in xy.coords(x, y): object 'pred.0' not found
Traceback:

1. points(OUTCOMES, pred.0, pch = 4, col = "blue")
2. points.default(OUTCOMES, pred.0, pch = 4, col = "blue")
3. plot.xy(xy.coords(x, y), type = type, ...)
4. xy.coords(x, y)
../_images/cliburn_Supervised_Learning_Solutions_89_1.png